-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
handle unresponsive nodes #474
Conversation
50f1429
to
2909126
Compare
2909126
to
be070fb
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ooh very cool to see progress on this issue. My main comment is structural -- I think we should try to remove the global reasoning about nodes from the per-node reconciliation. The least invasive possibility I can think of is to pull the egress gateway labeling into its own control loop, perhaps worth a shot!
eip_operator/src/controller/node.rs
Outdated
// The EIP will be disassociated and then this apply() will exit early since no EIP is associated | ||
warn!("Node {} is in Unknown state, disassociating EIP", &name); | ||
crate::aws::disassociate_eip(&self.ec2_client, &eip_name).await?; | ||
crate::eip::set_status_detached(&eip_api, &eip_name).await?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe this is race-y... many apply
fns for different nodes can run concurrently by default, and this set_status
operation doesn't include Kube's resourceVersion
to perform optimistic locking
as a result, you could have:
- Node A has the EIP and is marked
Unknown
- Node B spins up and the reconciler runs
set_status_should_attach
to attach it to B - The reconciler for Node A runs
set_status_detached
and removes it from B - Both reconcilers finish and the EIP is not marked as attached to either node 😱
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
a Kube replace_status
would add in some optimistic locking so the two reconcilers cannot overwrite each other, though I haven't thought much about how hard the error cases would be to reason about
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is now in place per use replace_status on set_status_detached
eip_operator/src/controller/node.rs
Outdated
|
||
return Ok(None); | ||
} | ||
_ => return Ok(None), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we not need to do anything for the other (true, true, "Unknown") | (true, false, "Unknown") | (false, true, "Unknown")
cases?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Worked with Justin on the cases and refactoring of the node controller. I think this is simplified / covered now.
eip_operator/src/controller/node.rs
Outdated
// This code path runs a limited number of times once a node goes unresponsive. | ||
// The EIP will be disassociated and then this apply() will exit early since no EIP is associated | ||
warn!("Node {} is in Unknown state, disassociating EIP", &name); | ||
crate::aws::disassociate_eip(&self.ec2_client, &eip_name).await?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if for any reason the eip-operator
crashed between executing disassociate_eip
and set_status_detached
, would we be able to reconcile and clear the status correctly?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@pH14 Good thought! We should probably remove the resource-id
label first! This would allow the EIP to be reclaimed by another node (or potentially the same node).
We might also want to swap the order in the node controller.
In the node controller, if the node is ready, it should see if it has an EIP already claimed and ensure it's attached to that eip. If no eip it's not attached to an EIP it should attempt to find one, and claim it, then it should attempt to associate and update the other status fields. (this can probably be done in a second PR, but if we do it this way it may remove where we can't fix the resource ID of the eip if it got removed, but the EIP is still associated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like this idea! Leaving for a separate PR as discussed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Feels like this is getting close! Left a few more comments, but the structure here feels much easier to grok
eip_operator/src/main.rs
Outdated
kube_runtime::watcher::Config::default().labels(EGRESS_GATEWAY_NODE_SELECTOR_LABEL_KEY); | ||
|
||
task::spawn(async move { | ||
let mut watcher = pin!(kube_runtime::watcher(node_api.clone(), watch_config)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hm I'd think we'd want to watch Eip
resources rather than Node
, since we set the Eip's status each time the Eip is associated or disassociated. it's not clear to me that this fires if we associate an Eip right now
separately, this watcher is also self-triggering, where setting the egress gateway status will trigger another reconciliation. not a correctness issue if it converges, but could create some read/write amplification or hard-to-debug issues if it does anything unexpected
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good find! The watcher is now correctly configured on Eip resources.
* cleanup egress_ready label * only detach don't disassociate EIP on unresponsive nodes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! This really feels you got to the essence of the problem here, the control loops feel very well-scoped, and the label_egress_nodes
is super clean.
README.md
Outdated
@@ -1,3 +1,4 @@ | |||
⚠️⚠️ **WARNING!** ⚠️⚠️ The Materialize K8s-eip-operator will soon be archived and no longer be under active development. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
probably enough to say that it will soon be archived or will soon be archived and is no longer under active development.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cleaned up the wording per node improvements. I left in the warning portion as I found that in another MZ public archive as a practice.
eip.meta_mut().resource_version = eip_v1.metadata.resource_version.clone(); | ||
let resource_version = eip_v1.metadata.resource_version.clone(); | ||
eip.meta_mut().resource_version = resource_version; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this changing here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After updating Rust I ran in to a clippy error. This change fixes:
assigning the result of
Clone::clone() may be inefficient
eip_operator/src/egress.rs
Outdated
// However, unresponsive nodes may still exist with a ready-status egress label | ||
// set to "true". This allows the old node to still serve traffic if possible until a | ||
// new node is ready to take over. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this agress with
// We immediately disassociate EIPs from egress nodes that are in an
// unresponsive state in the eip node controller.
If the EIP has been disassociated from a node it should not be serving traffic, we shouldn't wait to remove the status.
I'm not saying we want to do this, but if this is the behavior we're looking for I think we'd have to do something like adding a label to the EIP that says it can be reclaimed (or removing the eip resource id), but not actually detach it when a node becomes unresponsive.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Discussed over slack and covered by the egress labeling and node controller refactors. We agreed to set the status of the EIP to detached to free the claim. Egress will continue to attempt to go out the unresponsive node on the EIP as it will still be attached to the instance.
eip_operator/src/egress.rs
Outdated
eip_api: &Api<Eip>, | ||
node_api: &Api<Node>, | ||
) -> Result<(), Error> { | ||
let egress_nodes = get_egress_nodes(&Api::all(node_api.clone().into())).await?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We probably don't want to get all egress nodes, just the nodes that would match the EIP being reconciled.
I think the following should work.
let egress_nodes = node_api.list(&eip.get_resource_list_params()).await?.items;
This will break if the the eip is a pod filtering eip, we could return early if we see that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't realize this function existed - much cleaner!. Covered in egress improvements
eip_operator/src/egress.rs
Outdated
.into_iter() | ||
.filter_map(|eip| eip.status.and_then(|s| s.resource_id)) | ||
.collect(); | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we want to return early if there isn't an attachment that's ready we can return early if there isn't an attachment here, then I think we can just check if the attachment for the current eip is ready, if so we ensure it's marked as egress_ready, and everything else is marked as egress_unready. If it's not ready we do nothing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is covered by the refactoring of the egress labeler.
eip_operator/src/egress.rs
Outdated
// Note(Evan): find nodes that match eips we're reconciling | ||
// if eip has a resource id, see if the node with the resoruce is ready | ||
// if no, do nothing | ||
// if yes, mark that node as egress_ready=true, and mark all other nodes as egress_ready=false |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// Note(Evan): find nodes that match eips we're reconciling | |
// if eip has a resource id, see if the node with the resoruce is ready | |
// if no, do nothing | |
// if yes, mark that node as egress_ready=true, and mark all other nodes as egress_ready=false | |
// If EIP being reconciled has a resource id, find it's node, and check if it's ready. | |
// If not ready, return early. | |
// If ready, mark that node as egress_ready=true, and mark all other nodes as egress_ready=false. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense. Covered in egress improvements
eip_operator/src/egress.rs
Outdated
for other_node in egress_nodes | ||
.iter() | ||
.filter(|n| n.name_unchecked() != node.name_unchecked()) | ||
{ | ||
add_gateway_status_label(node_api, other_node.name_unchecked().as_str(), "false") | ||
.await?; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you already have a for each here, you may as well just
for other_node in egress_nodes {
// skip the node we just set to ready
if other_node.name_unchecked() != node.name_unchecked()) {
add_gateway_status_label(node_api, other_node.name_unchecked().as_str(), "false")
.await?;
}
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see now I already have the current node name. Adjusted as part of egress improvements
eip_operator/src/egress.rs
Outdated
/// Retrieve all egress nodes in the cluster. | ||
async fn get_egress_nodes(api: &Api<Node>) -> Result<Vec<Node>, kube::Error> { | ||
let params = ListParams::default().labels( | ||
format!( | ||
"{}={}", | ||
EGRESS_GATEWAY_NODE_SELECTOR_LABEL_KEY, EGRESS_GATEWAY_NODE_SELECTOR_LABEL_VALUE | ||
) | ||
.as_str(), | ||
); | ||
|
||
match api.list(¶ms).await { | ||
Ok(node_list) => Ok(node_list.items), | ||
Err(e) => Err(e), | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We probably don't need this if we're using the node selector on the eip.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Much cleaner and removed the need for some CONST labels in egress improvements
eip_operator/src/controller/node.rs
Outdated
// An Unknown ready status could mean the node is unresponsive or experienced a hardware failure. | ||
// A NotReady ready status could mean the node is experiencing a network issue. | ||
_ => { | ||
// Skip detachment if no EIP is associated with this node. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This comment is confusing..
If an EIP has a resource id label pointing to this node, remove that label releasing this nodes claim to the EIP
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree. I changed the wording as part of node improvements
eip_operator/src/controller/node.rs
Outdated
if let Some(eip) = node_eip { | ||
warn!( | ||
"Node {} is in an unresponsive state, detaching EIP {}", | ||
&name.clone(), | ||
&eip.name_unchecked() | ||
); | ||
crate::eip::set_status_detached(&eip_api, eip).await?; | ||
|
||
return Ok(None); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we moved this return out of the if and into the match block (with an appropriate comment/debug message) we could could avoid having to later check if the node is ready in line 134.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Much cleaner. I was glad to be able to remove the ready_status check from further on in the code. I simplified it in node improvements
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Approving, but I think you should change dbg!
to debug!
before merging
eip_operator/src/controller/node.rs
Outdated
} | ||
|
||
dbg!("Node {} is not ready, skipping EIP claim", &name); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should this be debug!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're right! Just pushed up a fix.
Motivation
Updates the eip-operator to handle unresponsive egress nodes. A new label is set on egress nodes indicating their status. New logic is added to disassociate EIPs from unresponsive nodes. A separate function handles cleaning up unresponsive egress nodes to update their status label.
Important: The README.md has been updated to reflect this repo being archived in the near future.
Notes for reviewer
Clippy warnings did not like the nested conditional blocks on match statements. I did some reasonable refactoring to meet the warnings. Not sure if it actually helps readability. What was more important to me was minimizing unnecessary API calls where possible.
Testing
I've tested egress nodes going unresponsive and egress nodes being rolled as well. The code changes match the existing behavior except now egress moves to the new node when the old one goes unresponsive.